# Multilingual Visual Question Answering
Eurovlm 9B Preview
Apache-2.0
EuroVLM-9B-Preview is a multimodal vision-language model based on the long-context version of EuroLLM-9B, supporting multiple languages and visual tasks. It is currently in the preview version.
Image-to-Text
Transformers Supports Multiple Languages

E
utter-project
156
2
Erax VL 7B V1.5
Apache-2.0
EraX-VL-7B-V1.5 is a powerful multimodal model specializing in Optical Character Recognition (OCR) and Visual Question Answering (VQA), excelling in multilingual environments with particular expertise in Vietnamese.
Image-to-Text
Transformers Supports Multiple Languages

E
mxw1998
26
0
Trillion LLaVA 7B
Apache-2.0
Trillion-LLaVA-7B is a vision-language model (VLM) capable of understanding images, developed based on the Trillion-7B-preview foundation model.
Text-to-Image
Transformers Supports Multiple Languages

T
trillionlabs
199
8
Internvl3 8B 6bit
Other
InternVL3-8B-6bit is a vision-language model converted to MLX format, supporting multilingual image-text-to-text tasks.
Image-to-Text
Transformers Other

I
mlx-community
70
1
Colqwen2.5 3b Multilingual V1.0
MIT
A multilingual visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, supporting dynamic input image resolution and multilingual document retrieval.
Text-to-Image Supports Multiple Languages
C
tsystems
13.29k
8
Llama 3.2 11b Vision R1 Distill
Llama 3.2-Vision is a multimodal large language model developed by Meta, supporting image and text inputs, optimized for visual recognition, image reasoning, and description tasks.
Image-to-Text
Transformers Supports Multiple Languages

L
bababababooey
29
1
Centurio Aya
Centurio is an open-source multilingual large vision-language model supporting 100 languages, capable of processing image-to-text and text-to-text tasks.
Image-to-Text
Transformers Supports Multiple Languages

C
WueNLP
29
4
Paligemma2 10b Mix 448
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text inputs to generate text outputs, suitable for various vision-language tasks.
Image-to-Text
Transformers

P
google
31.63k
25
Mblip Bloomz 7b
MIT
mBLIP is a multilingual vision-language model based on the BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.
Image-to-Text
Transformers Supports Multiple Languages

M
Gregor
21
1
Mblip Mt0 Xl
MIT
mBLIP is a multilingual vision-language model based on BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.
Image-to-Text
Transformers Supports Multiple Languages

M
Gregor
374
14
Featured Recommended AI Models